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1 Introduction 



One of the severe problems in lattice QCD from the practical point of view is 
the numerical inversion of sparse matrices. It nearly enters every Monte Carlo 
simulation, either in the quenched approximation of the theory in applying the 
inverse of the discretized Dirac operator on source vectors for the computation 
of quark propagators, or in full QCD with Hybrid Monte Carlo like algorithms, 
where a similar operation is necessary when calculating the fermionic force in 
order to include the quark field dynamics in the updating process of the gauge 
fields. 

In recent years substantial progress has been made by the use of Krylov sub- 
space iterative solvers in conjunction with preconditioning techniques. (See 
e.g. [1] for a textbook reference and the reviews [2-4], which contain an ex- 
tensive bibliography.) Popular choices for the inverter to be mentioned in the 
context of lattice simulations are the Conjugate Gradient (CG) algorithm [5,6], 
the Minimal Residual (MR) algorithm [7,8] and above all the stabilized Bi- 
Conjugate Gradient (BiCGStab) algorithm [9,10]. The latter now seems com- 
monly established as the most efficient solver in a vast majority of QCD ap- 
plications [11,12], particularly in the parameter region of small quark masses. 

In the area of preconditioning, any new (and potentially competitive) idea 
should be confronted with an even-odd (e/o) decomposition of the Dirac ma- 
trix [13,14], which both for ordinary Wilson fermions and together with an 
0(a) improvement term [15] has become the state-of-the-art method. An ear- 
lier step towards an alternative approach was the incomplete LU factorization 
[16] utilizing a g'/o6a//y-lexicographic ordering of the lattice points and thereby 
already related to SSOR. However, it turns out to be unsatisfactory when an 
implementation on grid-oriented parallel computers is envisaged [17]. Finally, 
the Wuppertal group invented a parallelization of symmetric successive overre- 
laxed (SSOR) preconditioning, a variant of the classical Gauss-Seidel iteration 
[1], resting upon a /oca/Zy-lexicographic ordering of the points within a given 
(sub-)grid of the total space-time lattice [18-22]. They showed that at least 
for Wilson fermions an SSOR preconditioner can perform much better that 
the standard e/o one. 

Since many comparative studies of the properties of the various inversion 
algorithms and preconditioning methods are already available in the literature 
[4,12,17], we restrict ourselves in the sequel exclusively to the case of the 0(a) 
improved Wilson-Dirac operator involving the Sheikholeslami-Wohlert clover 
term [23] within the Schrodinger functional of QCD [24,25]. 

The Schrodinger functional (SF) is defined as the partition function of QCD in 
a space-time cylinder of extension L^ x T with periodic boundary conditions 



in the space directions and (inhomogeneous) Dirichlet boundary conditions 
at times and T. As detailed e.g. in refs. [25,26], the SF has proven to be 
a valuable tool for computing the running coupling constant in a finite-size 
scaling analysis as well as for extracting phenomenological quantities from 
simulations in physically large volumes [27,28]. On the other hand, the (9(a) 
improved Wilson-Dirac operator is now a good candidate to probe continuum 
QCD by means of numerical simulations on the lattice: most coefficients mul- 
tiplying the counterterms required for a complete removal of the leading 0(a) 
lattice artifacts of action and quark currents are known non-perturbatively in 
the quenched approximation [25,29,30], and also for the action in the case of 
two flavours of dynamical quarks [31]. Therefore, a quantitative investigation 
of the performance of a (parallelized) SSOR preconditioner for the SF setup 
incorporating 0(a) improvement appears to be of natural interest. This is 
what we intend with the present communication. 

One might ask what should be different from periodic boundary conditions 
in space and time. At first, it is in principle not excluded that with Dirichlet 
boundary conditions a (specifically preconditioned) solver has generally lower 
iteration numbers. Another point concerns a definite advantage in the ac- 
tual implementation, since the SF allows to circumvent inefficient conditional 
statements as will be explained later. 



2 SSOR preconditioning 

The basic problem posed is to solve a system of linear equations of the type 

AX = Y ^ R = AX-Y = 0. (1) 

To fix some notation, small Greek letters denote scalars, capital Latin ones vec- 
tors with components in small letters, and matrices have calligraphic symbols; 
(X, Y) = J2i ^iVi is the usual scalar product. In lattice QCD, A represents the 
discretized Dirac operator, a huge sparse matrix of rank fi x 4 x 3 emerging 
from the nearest neighbour couplings of the quark and gluon field variables in 
position space after discretization of the interacting continuum theory. More 
precisely, Q is the volume of the four- dimensional space-time lattice, and the 
latter factors are connected to Dirac and SU(3) colour spaces. Physically, the 
solution of eq. (1) can be regarded as a fermionic Green function (i.e. a quark 
propagator) . 

SSOR preconditioning consists of solving the system 

V^^AV^^X = Y, X = V2X, Y = V^^Y (2) 



instead of AX = Y, where in the present context A stands for Q, the lattice 
Dirac operator of Wilson fermions supplemented with a local 0(a) improve- 
ment term aa^uF^u{,x), which is composed of the SU(3)-valued gauge link 
variables U{x,^) to form a clover leaf [23]: 



+ Csw - a K ^ (Tf^uF^u (x) 6x,y (3) 



with F^u{x) equal to 



U{x, fi)U{x + aft, u)U{x + au, fi)^U{x, u) ' 

+ U{x, v)U{x + aO — afi, fi)^U{x — afi, i')'^U{x — afi, fi) 

+ U{x — afi, fi)^U{x — aO — afi, v)^U{x — aO — afi, fi)U{x — aO, v) 

+ U{x — aO, i')^U{x — aO, fi)U{x — av + afi, v)U{x, ^)^ 

(4) 

Csw is an improvement coefficient non-perturbatively determined in quenched 
and two-flavour QCD [29,31]. (The slight modification to (3) induced by the 
Dirichlet boundary conditions in time direction are not written out here.) If 
one introduces the decomposition ^ 

Q = V-C-U (5) 



into block diagonal, block lower-triangular and block upper-triangular parts 
with respect to position space, the SSOR preconditioner is defined in terms 
of these matrices and a non-zero relaxation parameter u;ssor (which serves to 
reduce the iteration number) through the choice 

Vi={—^V-C](^^v] , V2 = —^V-U. (6) 

V^SSOR / VcJsSOR / ^SSOR 



Now it is advantageous to exploit in (2) the so-called 'Eisenstat trick' [32], 
embodied in the identity 



^ v{ V-C] {V-C-U){ V-U 



i^SSOR V^SSOR / Vf^SSOR 



^ The case of Wilson-Dirac fermions is recovered by setting T) to the unit matrix. 



.-1^-1 



= (l - u;ssoR'C'P ^) 1 1 + (t^ssoR - 2) (l - uJs^o^UV 

-^(l-uos^OKUV-^y\ (7) 

to save on computational costs: namely, it implies that any matrix-vector 
product with the preconditioned matrix Vf^QV2~^ essentially gives rise to a 
backward substitution and a subsequent forward substitution process, corre- 
sponding to the application of the (non-block) upper and lower triangular 
matrices 1 — ujq,sokUV~^ and 1 — cussoR'^^"^, respectively. These relations re- 
flect that SSOR splits any application of Q, which usually enters the inversion 
procedure associated with (1), into an equivalent but much simpler sequence 
of arithmetic operations with the set of matrices {P, C,U}. 

Then, combining the BiCGStab algorithm with SSOR preconditioning, an 
iterative numerical solution of the system in eq. (1) up to a given precision e 
is obtained by the prescription^ (preconditioned quantities carry a tilde): 

initialization with guess Xq : 

Ro = Y- QXo 
(1 — ujssoR^T^ )Ro = Rq 

R = Ro Po = 1 



k-th iteration (A; > 1) : 






Pk = {R,Rk-i) 




_ pkak-i 

Pk-i^k-i 



Pk — Rk-1 + /3-Pfc-l — P^k-lVk-l 

[1 - ussokUV-^)W = Pk, Z' = Pk + (o^ssoR - 2)W^ 
(1 - ujs^okCV-^)Z = Z\ Vk = W + Z 



Pk 
ak 



(R, Vk) 



S = Rk^i — otkVk 



^ The algorithm can straightforwardly formulated for other solvers, like MR for 
instance. Note, however, that in the chiral quark mass regime MR is generally of 
worse performance. 



(1 - oossokWD-^W = S, Z' = S+ (cc^ssor - 2)f/ 
(1 - ujssorCV-')Z = Z', f = U + Z 

{f,S) 
{T,T) 

Xk = Xk^i + uJkU + akW 

Rk = S — LOkT 
The stopping criterion to be imposed on the preconditioned residual is 

{Rk, Rk) , 2 

{Xk-,Xk) 



Some comments are in order. 

• Taking the forward substitution as example, one calculates recursively in 
terms of the (block) components Zj, Z[, Vij and Cij of Z, Z', V and C: 



V^: Z, = Z[ + Y.'^^^H,, H,=ujssor{v-'),,Z,\,^,_^. (9) 



jj " i<«-i 



I.e. thanks to the triangularity of C (and U) it can be done economically 
without involving a plain matrix multiplication, and owing to the sparsity 
of Q only a few j actually contribute to the sum. Provided that the inverses 
(D~^)^^ are pre-computed, the backward and forward solves together are 
exactly as expensive as one application of the whole matrix Q plus one 
additional V multiplication. 

In contrast to the unimproved case (where Xk = Xk follows immediately), 
the solution of the original system QX = Y is now Xk = cursor -^"^^fc • 
Choose Xq = = Xq as initial guess for the solution to avoid an application 
of "D, as part of Q, at the beginning, which yields Rq = Y—QXq = Y.liV is 
not needed elsewhere, this might be favourable in view of possible memory 
limitations of the hardware (e.g. setting an upper bound on the accessible 
lattice volumes), because we made the experience that savings in iteration 
number when using an available solution of a foregoing inversion as initial 
guess are generically negligible. 

Since in order to save on computational cost the stopping criterion (in our 
runs e^ = 10^^^, 10^^'^) is conveniently based on Rk, one might want to re- 
compute Rk at the end to test for convergence again and eventually — if 
the solution is not yet accurate enough — to continue the iteration with a 
stronger stopping criterion imposed on Rk- 



In the SSOR scheme the minimal number of vectors to be stored at each 
iteration /c is 9, if Rk-i and S share the same memory location, and if the 
source vector Y is overwritten by the iterative solutions X^. 
Due to the overall Dirac and colour structure of the fermion field variables 
at every lattice point, Q is genuinely partitioned into blocks of dimension 
12 X 12. The dependence of the speed of the SSOR preconditioner on the 
diagonal (sub-)block size of Vij was not studied. 



2. 1 Implementation for the Schrodinger functional 



For the implementation of SSOR preconditioning on a parallel computer, the 
ordering of the lattice points plays a key rolef^]- It determines the shape of Q via 
the pattern of its non-zero entries and thereby the degree of parallelism and the 
efficiency of the preconditioner. Via adapting a locally-lexicographic scheme 
we closely follow refs. [19,21], where different orderings and their consequences 
on the parallelization have been discussed. Hence we only describe the salient 
issues characteristic for the SF approach. 

Assume that a given space-time lattice is matched on the, say, three-dimen- 
sional grid of processing nodes of a parallel computer. Then each node occupies 
a local lattice, where three of its extensions are ratios of the total lattice sizes 
in the respective directions and the corresponding numbers of processors. The 
locally-lexicographic ordering ('colouring') is ensured by an alphabetic order- 
ing of the lattice sites belonging to these local lattices. Associating a colour 
with each fixed position within the local lattices, the original lattice is divided 
into groups, whose members couple to sublattice points of different colours 
only. In this way it becomes obvious that the forward and backward substi- 
tutions, e.g. to get Zi as in eq. (9), can be handled in parallel within the 
coloured groups, since for all positions of a given colour only their prede- 
cessors — in the lexicographic sense — enter the necessary computation^. 
Because among these lexicographically preceding sites, lattice points living on 
the neighbouring processors contribute too, the quantities Hi of (9) will have 
to be communicated from those processing nodes to the site in question. 

At this point we have to note that there is a crucial difference between a situ- 
ation with ordinary (anti-)periodic boundary conditions in all four space-time 
directions and our SF setup with Dirichlet boundary conditions in time. To 
see this difference in more detail, let us resort to an example in one dimen- 
sion. The four points in figure 1 define a one-dimensional lattice with periodic 



^ We remark that traditional e/o preconditioning can be interpreted as SSOR 
preconditioning of the even-odd ordered system [19]. 

^ So the general strategy would be to maximize the number of coloured groups, 
while maintaining its strengths in accordance with the desired parallelization. 



boundary conditions, and one can think of a 'coupling matrix' between the 
sites of mutual dependence. Since only nearest neighbour sites are coupled, we 

A B B C 



Fig. 1. A one-dimensional example. The direction of the arrows shows the data 
Bow: sites of kind B need information from site A, and so forth. 

immediately see that in applying eq. (9) one has to distinguish between three 
cases. (We restrict to the forward solve, because it is straightforward to work 
out the necessary changes for the backward solve.) 

A The point is on the 'left', i.e. it has coordinate x = 1: (9) simply becomes 

Zi = Z[. (10) 

B The point lies within the 'bulk', i.e. it has coordinate a; = 2, 3: (9) becomes 

Zx = Z^ + Cx^x-iHx-i , X = 2, 3 . (11) 

C The point is on the 'right', i.e. it has coordinate x = 4: then (9) becomes 

Zi = Z^ + Ci^sHs + Ci^iHi . (12) 

It is clear that we have those three cases for each dimension, in which ei- 
ther periodic or antiperiodic boundary conditions are prescribed. For a four- 
dimensional lattice this leads to 3*^ = 81 different cases to be handled, and it is 
natural to implement the algorithm with a do-loop over all the lattice points 
and some if -statements to discriminate between the 81 cases. The parallel 
version of the algorithm just described needs minor modifications: looking at 
figure 2, only the case C has to be replaced with 

Z, = Z', + C,,sHs + £4,ii/f "'* ""''^'"^ , (13) 

where now the coordinate index has a local meaning, labelling the sites inside 
the sublattice residing on a given processor, and also the processor's mesh is 
assumed to have periodic boundary conditions in the sense that the processor 
'next' to the last one in each direction (rightmost in figure 2) is to be identified 
with the first one in the same direction (leftmost in figure 2). 

Our numerical simulations were done on the 8-512 nodes APE- 100 massively 
parallel computers with cubic topology and nearest neighbour communication, 
built up of an array of elementary processing boards with 2x2x2 nodes 



A B B C ' 


A B B C 



Fig. 2. A one-dimensional parallel example. As in the non-parallel case, Ggure 1, 
the arrows represent the data Bow involved. 

[33,34]. They possess a SIMD (single- instruction multiple- data) architecture 
and are either suited to distribute a single, typically large, lattice over the 
whole machine or to simulate in parallel several independent copies of a lattice 
on subsets of the machine in the case of smaller volumes. 

With such a cubic topology it suggests itself to keep the whole time extent 
of the lattice within the processors, and to only split the spacelike volume 
into sublattice fractions with respect to the three-dimensional processor grid. 
If SF boundary conditions are adopted, one can make sure that, as far as the 
time direction with its first and last timeslices fixed to the boundary values is 
concerned, a site belongs always to the 'bulk'. This lowers the number of cases 
to be distinguished in order to implement the forward and backward solves 
to 3^ = 27. So it becomes rather near at hand to encode these cases explic- 
itly in a sensible arrangement of nested do-loops alone, i.e. via an outer loop 
over the full time coordinate and inner ones over the coordinates of the local 
space directions on every processor to exhaust the total lattice volume; thereby 
the usage of any if-statements is completely avoided. Here arises the signif- 
icant advantage of our implementation: the latter type of statements cause 
— especially so on the APE-100 machines — a drastic deterioration of the 
performance by breaking the so-called 'optimization blocks' recognized by the 
compiler. In practice, the contents of the registers is lost each time a branch- 
ing statement like if, do, call subroutine, ... is encountered, because this 
amounts to a break in the memory pipeline pre-loaded before. Writing out the 
27 distinct cases explicitly, however, reduces the impact of the largest part of 
such statements, so that finally we are able to arrive at a substantial speed-up 
of the code0- 



2.2 Performance tests 



After realizing the implementation outlined above, we first have investigated 
the influence of the relaxation parameter o^ssor on the numerical solution pro- 



Of course one can imagine to write down analogously the 81 different cases for 
the familiar periodic situation [19], but then the size of the executable file might 
easily exceed the integer and/or program memory limits of the machine. 



cedure for the linear system in eq. (1). It is quantified through the ratio of the 
numbers of iterations to solve the system with the e/o preconditioned matrix, 
Nc/o, and with the SSOR preconditioned one, A^ssoR) under otherwise identical 
conditions. We show these ratios in dependence on cussor; averaged over a set 
of 0(10) propagator computations occurring in quenched simulations, for two 
lattice sizes with SF boundary conditions in figure 3. Upon distributing these 
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Fig. 3. Improvement factor in the number of BiCGStab iterations, when simulating 
with SSOR and different choices for wssoR instead of e/o preconditioning. The 
lattice sizes are 8^ and 16^ x 32, with parameters [5 = 6.0 and k = 0.1335, 0.1342. 

lattices over the meshes of nodes of the APEs, the lattice extensions per node 
were 4^ x 8 and 2^ x 32. The course of A'^e/o/A'ssoR confirms a conclusion of the 
Wuppertal group [21] that between c^ssor — 1-3 and cjssor — 1-6 the gain in 
the number of iterations needed for the fermion matrix inversions reaches its 
maximum: the corresponding improvement factor is around or even above 2, 
while the tendency for a further increase of A'e/o/A'ssoR towards the chiral limit 
(i.e. larger /t and smaller quark mass) is also seen. Above cussor — 1-6 — 1.8 this 
ratio drops rapidly and the gain gets lost; therefore, we take over u;ssor = 1-4 
to be considered as an 'optimal' compromise irrespective of the definite values 
for lattice sizes and/or parameters. 
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Now we pass to the central question, whether the gain in the number of it- 
erations also translates into a visible CPU time gain. Of course, as already 
pointed out previously, this will depend on the hardware architecture of the 
machine in use as well as on the individual implementation. At the peak in 
the upper diagram of figure 3, for instance, a total saving of 1.5 in the spent 
simulation time can be attained. In table 1 we collect the approximate per- 
formance gain factors in units of iteration number and net CPU time, which 
were found in the situation of a realistic (quenched) QCD simulation in phys- 
ically large volumes. Here we applied SSOR and e/o preconditioning for a 

Table 1 
Approximate gain factors of SSOR over e/o in iteration number and net CPU time 



sublattice/node 


subvolume/node 


iteration gain 


performance gain 


23 x32 


8-32 


2.0 


1.4 


2^ X 4 X 32 


16-32 


2.1 


1.4 


2 X 42 X 32 


32-32 


2.2 


1.5 



set of 300 fermion matrix inversions on thermalized 16^ x 32 configurations 
at /9 = 6.0 and k = 0.1335,0.1342, where the relaxation parameter was set 
to ci^ssoR = 1-4 throughout and the pseudoscalar mass at those couplings is 
amps = 0.388,0.300 [27,28]. Moreover, we examined the dependence of the 
SSOR preconditioner on the fractional grid size per node treated by a single 
processor. As illustrated by the numbers in table 1 and in figure 4, the iter- 
ation number ratio and thus the preconditioning efficiency slightly increases 
with growing volumes of the different local subgrids, 2^ x 32, 2^ x 4 x 32 and 
2 X 4^ X 32, if the 16^ x 32 lattice is spread over the 512, 256 and 128 proces- 
sors of the available machines, respectively. This complies with the heuristic 
expectation that an enlarged number of coloured groups, i.e. sets of points at 
the same fixed position within the local sublattices (according to their locally- 
lexicographic ordering), entails a measurable iteration gain in the inverter, 
whereas the performance stays nearly unchanged owing to less parallelism and 
hence a small accompanying inter-processor communication overhead. Because 
of the equivalence of e/o preconditioning to a colouring into two groups as- 
signed to the even and odd sublattice, one can accommodate the e/o iteration 
number in the upper part of figure 4 too. Additionally, the points in both 
diagrams indicate once more the even better behaviour of BiCGStab-SSOR in 
the range of lighter quark masses. 

The foregoing observations are supported by the large scale simulations in 
the strange quark mass region underlying the extraction of hadron masses 
and matrix elements in quenched QCD with the SF reported in refs. [27,28]. 
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Fig. 4. Upper part: BiCGStab iteration number in dependence of the local lattice 
volume per node after distributing a 16^ x 32 lattice on different three-dimensional 
meshes of processing nodes. Note that the common time extent of the local lat- 
tices, T = 32, has been divided out. The e/o iteration numbers are included for 
comparison, and the lines are only meant to guide the eye. Lower part: Associated 
improvement factors in the number of iterations. 



There, at /9 = 6.1, 6.2 on 24^ x T lattices (with 3^ x T sublattices per node and 
T = 40, 48) and at /3 = 6.45 on a 32^ x 64 lattice (with 4^ x 8 x 64 sublattice per 
node), SSOR enabled to save against e/o always CPU time factors of around 
1.5-1.6. 

Altogether these results clearly reveal that the replacement of e/o by SSOR 
preconditioning to solve the 0(a) improved Wilson-Dirac equation in the SF 
scheme pays off in real simulation costs. In contrast to what the authors in [21] 
obtain, for the same local diagonal block size of Q and computer class, from 
QCD simulations including the clover term with ordinary boundary conditions, 
the SF type of boundary conditions allow for an efficient implementation of 
SSOR preconditioning also on massively parallel machines with an architecture 
equal or similar to that of the APE-Quadrics systems. 
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3 Conclusion 



We have demonstrated in numerical simulations of quenched lattice QCD with 
the Sheikholeslami-Wohlert quark action that the increase of performance be- 
tween even-odd and SSOR preconditioning in a parallel implementation can 
be a factor ~ 1.5, when Schrodinger functional boundary conditions are em- 
ployed. 

Opposed to the more standard situation with periodic boundary conditions 
in all directions studied in ref. [21], the gain factors in real time consumption 
come out to be significantly better here. The main reason for this originates 
in the lower number of cases (27 versus 81) to be distinguished explicitly, 
when the contributions to a given site among the locally-lexicographically 
ordered points of the local sublattices residing on the processors have to be 
collected: the avoidance of any conditional statements in the solver routines 
evades unwanted breaks in the data flow within the registers of the computer, 
which then directly translates into a considerable gain in units of CPU time. 
We have to emphasize that this inherent sensitivity to pipeline optimization 
might be — at least partly — a special feature of the APE-100 environment. 
Nevertheless, since some conclusions drawn from the investigations in [20-22] 
refer to the identical particular machines, our findings should be of interest in 
the same context and can be compared with the results stated there. 

As the Schrodinger functional formulation of QCD is physically already well 
accepted to be a viable framework to address many problems in the non- 
perturbative low energy regime of the theory [25,28], the observed evidence 
for a performance gain of SSOR (together with BiCGStab as the inverter) 
for the 0(a) improved Wilson-Dirac operator in actual run time — also on a 
parallel computer — constitutes a further benefit of this scheme. Therefore, 
the feasibility of an efficient implementation of this preconditioner does not 
only provide an important algorithmic information by itself, but even more is 
also promising for any kind of future applications of the Schrodinger functional 
in lattice QCD. 
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